Duplicate Web Pages Detection with the Support of 2d Table Approach

نویسنده

  • JAYA KUMAR
چکیده

Duplicate and near duplicate web pages are stopping the process of search engine. As a consequence of duplicate and near duplicates, the common issue for the search engines is raising the indexed storage pages. This high storage memory will slow down the process which automatically increases the serving cost. Finally, the duplication will be raised while gathering the required data from the various sources based on the user’s query. The duplication will definitely slow down the information retrieval process. Duplication is nothing but the similar content or documents located under various sites. Content duplication can be taken place at different forms and levels such as exact document copy, paragraph copy, sentence copy, single word changes and sentence structure changes. Duplication detection is the process of identifying the multiple representations of a same real world object. In this paper, the content duplication is identified using two dimensional (2D) text matrix approach. By using the proposed 2D matrix approach, the system was able to detect duplicate web pages with a high precision value 92% is highlighting that the duplicate web page detection with the 2D technique is performing well Keyword: Near Duplicate Detection, 2D Approach, Information Retrieval, Content Duplicate.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Efficient Approach for Near-duplicate page detection in web crawling

The drastic development of the World Wide Web in the recent times has made the concept of Web Crawling receive remarkable significance. The voluminous amounts of web documents swarming the web have posed huge challenges to the web search engines making their results less relevant to the users. The presence of duplicate and near duplicate web documents in abundance has created additional overhea...

متن کامل

Analyzing new features of infected web content in detection of malicious web pages

Recent improvements in web standards and technologies enable the attackers to hide and obfuscate infectious codes with new methods and thus escaping the security filters. In this paper, we study the application of machine learning techniques in detecting malicious web pages. In order to detect malicious web pages, we propose and analyze a novel set of features including HTML, JavaScript (jQuery...

متن کامل

تشخیص ناهنجاری روی وب از طریق ایجاد پروفایل کاربرد دسترسی

Due to increasing in cyber-attacks, the need for web servers attack detection technique has drawn attentions today. Unfortunately, many available security solutions are inefficient in identifying web-based attacks. The main aim of this study is to detect abnormal web navigations based on web usage profiles. In this paper, comparing scrolling behavior of a normal user with an attacker, and simu...

متن کامل

A Near-duplicate Detection Algorithm to Facilitate Document Clustering

Web Ming faces huge problems due to Duplicate and Near Duplicate Web pages. Detecting Near Duplicates is very difficult in large collection of data like ”internet”. The presence of these web pages plays an important role in the performance degradation while integrating data from heterogeneous sources. These pages either increase the index storage space or increase the serving costs. Detecting t...

متن کامل

A Query-Dependent Duplicate Detection Approach for Large Scale Search Engines

Duplication of Web pages greatly hurts the perceived relevance of a search engine. Existing methods for detecting duplicated Web pages can be classified into two categories, i.e. offline and online methods. The offline methods target to detect all duplicates in a large set of Web pages, but none of the reported methods is capable of processing more than 30 million Web pages, which is about 1% o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014